A SUGGESTED  OF  SEQUENTIAL  A><A LYSIS  ^ 
PERfORMANCt  ACOfcPTANCE  irEi>TiNO 


Prepared  for 

Personnel  Analysis  Division 
Bureau  of  Naval  Personnel 
Contract  N6ori  - 0711*2 


*N  4 4*  liTOT  C"> 


by 

Rupert  N.  Evans,  Ph.D. 
College  of  Education 
University  of  Illinois 
Urbano,  Illinois 


"lABLE  OF  CONTENTS 


Chapter  I — Introduction « • 1 

Acceptance  Testing  of  Supplies  in  Industry 

and  the  Armed  Forces  ........  2 

Probable  Limitations  of  Sequential  Sampling 

in  Testing  People  3 

Advantages  of  Sequential  Sampling  in  Testing 

People. *3 

Sunmary U 

Chapter  II  — Scunpling  in  Testing 5 

The  Operating  Characteristics  Curve 6 

Sunmary 10 

Chapter  III  — Choosing  a Sequential  Sampling  Plan  ...  11 

Information  Needed  for  Choosing  a Sequential 

i xau  ••••••••••  uw 

Acceptable  and  Unacceptable  Persons 12 

Probability  of  Acceptance  12 

Jeans  of  Reporting  Scores 13 

Item  Intercorrelation  and  Difficulty  U4 

The  Averag?  Saiupie  Size  Curve .15 

Sunmary 16 

Chaptei  IV  ””  Tentative  Suggestions  for  Putting  s 

Sequential  Sampling  Plan  Into  Operation.  ....  17 

Tentative  Standardization  of  Test  Items 17 

Computation  of  A and  B 20 

Scoring  Performance  Items  D\iring  Routine 

Test  Administration  22 

Example  of  Procedure  in  Sequential  Analysis.  . . . 23 

Application  of  Ds  Values  2U 

Suggested  Modification  of  Scoring  Performance  Items.  . 25 

Estimating  an  Operating  Characteristics  Curve  ...  25 

Computation  of  Average  Sample  Size  Required.  . . .2? 

Alternative  Procechire  for  Calculating 

Average  Sample  Size 28 

Determination  of  Minimum  Number  of  Items 

Required  to  Pass  or  Fail  a Tcctcc 2? 

Summary 29 

Appendix.  30 


INTRODUCTION 


One  of  the  chief  obstacles  to  the  more  widespread  use  of  performance 
testing  has  been  the  relatively  greater  time  and  expense  required  by 
performance  tests  as  compared  with  paper  and  pencil  tests.  Many 
performance  tests  are  in^vidually  adMnistered,  and  require  the  use 
of  a highly  trained  observer  and  a piece  of  expensive  equipment  for 
one  or  two  days  in  order  to  administer  ten  or  twenty  items  to  one  indi- 
vidual. During  one  or  two  hours,  one  test  administrator  can  give  perhaps 
fifty  to  two  hundred  paper  and  pencil  items  to  a large  group  of  subjects 
in  an  ordinary  class  room.  Obviously  on  a tine  and  expense  basis, 
paper  and  pencil  tests  are  much  more  acceptable.  That  performance  tesla 
have  continued  to  be  used  at  all  is  a testimony  to  strong  feelings  about 
their  usefilnpas. 

Performance  tests  are  ordir.arily  used  for  one  or  more  of  three 
purposes:  (l)  to  deterndne  whether  or  not  the  person  tested  possesses 

certain  qualities  to  a desired  degree — acceptance  testing;  (2)  to 
determine  in  what  areas  the  person  tested  needs  further  training — 
remedial  testing;  (3)  to  enable  the  paison  tested  to  perform  certain 
activities  more  skillfully — instructional  testing. 

The  first  of  these  purposes — the  use  of  performance  tests  for 
acceptance — is  most  widely  used.  Typical  situations  calling  for  this 
sort  of  testing  are  for  the  determi nation  of:  (1)  graduation  from  a 

particxilar  course  or  phase  of  a course;  (2)  acceptability  for  skilled 
en5Dloyment;  and  (3)  advancement  in  rating.  Arytime  you  desire  to  know 
whether  a particular  applicant  passes  or  fails,  is  successful  or 
unsuccessful,  is  desirable  or  undesirable,  you  can  use  performance  tests 
as  the  basis  for  acceptance  or  rejection. 

The  second  and  third  uses  of  performance  testing,  remedial  and 
instructional,  are  seldom  used  outside  of  training  programs.  While  they 
are  important,  this  discussion  is  not  concerned  with  them  primai*ily. 

Regardless  of  the  purpose  or  purposes  for  which  they  have  been 
designed,  performance  tests  have  been  administered,  ordinarily,  as  a 
block.  That  is,  each  person  tested  is  given  the  same  number  of  performance 
test  items.  Paper  and  pencil  tests  also  have  been  administered  as  a rule 
in  this  same  manner,  with  every  person  taking  the  test  being  given  the 
same  number  of  items.  It  is  the  piurpose  of  this  discussion  to  show  that 
this  is  not  necessarily  the  most  economical  procedure,  perticularly  for 
pei lurmance  acceptance  testing.  Sequential  sampling  is  proposed  as  sn 
alternative.  Since  sequential  sampling  has  been  used  most  widely  in 
industrial  acceptance  of  supplies,  let  us  take  a look  at  aequential 
analysis  as  it  is  used  in  industry  at  the  present  time,  to  see  if  it  has 
application  to  performance  testing  of  people. 


- 2 - 


Acceptance  Testing  of  Supplies 
in  Indiistry  and  the  Armed  Forces 


It  is  usually  necessary  for  any  agency  purchasing  conaaodities  fron 
some  other  agency  to  determine  the  quality  of  the  products  which  are 
supplied  to  it.  If  you  set  out  to  buy  a q\iantity  of  receivers  from  some 
vendor,  you  establish  specifications  for  these  receivers,  and  then  check 
the  receivers  which  are  shipped  to  you  to  see  whether  cr  not  thqr  are 
acceptable.  Similarly,  if  one  section  of  your  organization  produces 
hydraulic  fittings  which  will  later  be  assembled  into  a gun  oo\int,  you 
ordinarily  will  check  the  fittings  to  see  whether  they  are  acceptable 
before  shipping  them  to  final  assembly. 

One  method  of  determining  acceptability  of  products  is  to  give  them 
a one  hundred  per  cent  check.  This  obviously  is  impractical  if  the  test 
destroys  the  product  being  tested.  But  even  with  non-destructive  tests, 
most  industrial  concerns  have  adopted  sampling  procedures  for  determining 
the  acceptability  of  a lot  (group  of  products).  Usually  the  procedure 
has  been  to  determire  that  a certain  number  or  a certain  percentage  of  a 
lot  would  be  checked,  <mu  the  acceptability  of  the  whole  lot  determined 
from  the  san^ile.  (This  is  basically  the  same  procedure  we  use  in  testing 
people.  If  we  want  to  determine  a person's  grade  in  a course  at 
Annapolis , we  pick  a sample  of  the  almost  infinite  number  of  questions 
we  could  ask  about  the  course,  and  estimate  the  percentage  of  questions 
he  could  answer  from  the  percentage  of  questions  he  did  answer  correctly 
on  the  iest.)  Note  that  with  this  procedure,  the  number  of  items  to  be 
tested  is  determined  befoi*e  testing  is  begun. 

Since  World  War  II,  a method  of  sampling  called  "sequential  sampling" 
has  been  coming  into  wide  use  in  quality  caitrol  in  industry  and  the  armed 
forces.  Essentially,  this  method  of  sampling  requires  that  a small  sanplo 
of  the  lot  be  tested.  Then  on  the  basis  of  this  sanple,  one  of  three 
decisions  is  reached:  (1)  accept  the  lot,  (2)  reject  the  lot,  or 

(3)  cuutinue  testing.  If  it  is  decided  to  continue  testing,  another 
sample  is  inspected,  and  < n the  basis  of  this  information  plus  the 
information  from  the  preceding  lots,  one  of  the  above  three  decisions 
is  made.  Sampling  is  continued  until  the  lot  can  be  accepted  or  rejected. 
This  method  requires  much  smaller  samples  for  the  lots  that  ore  extremely 
good,  or  extremely  poor.  For  the  few  lots  that  are  on  the  borderline 
between  acceptance  or  rejection,  you  may  san?)le  as  many  or  more  items 
than  are  required  when  every  lot  is  tested  in  exactly  the  same  way. 

Almost  invariably,  however,  for  a given  degree  of  confidence  in  the  results, 
a»»nv.en+iel  ?22!?>ling  -till  require  fc^cr  test::  tiieu*  die  icquireu  by 
conventional  sampling.  (Wald,  in  his  book,  Sequential  Analysis,  estimates 
the  average  saving  at  about  50  per  cent.) 

There  appear  to  be  good  reasons  »*iy  sequential  sampling,  which  has 
proved  so  successful  in  acceptance  sampling  in  industry  and  the  armed 
forces,  can  be  applied  with  success  in  acceptance  testing  of  people. 


- 3 - 


Probable  Limitations  of  Sequential  Sanyllng 
In  Testing  People 

One  feature  of  sequential  saopling  makes  it  difficult  to  eaploy  with 
certain  types  of  tests  when  you  are  testing  people  instead  of  conodities. 
Before  one  can  proceed  ft'oa  the  first  to  the  second  test  item  In  a group  of 
test  items,  it  is  necessary  to  know  the  score  a person  has  made  on  the 
first  item,  or  at  least  whether  the  person  has  passed  or  failed  that  item. 
For  group  teats  of  the  paper  and  pencil  type,  a person  could  be  taking 
several  items  during  tS»e  time  required  for  the  first  one  to  be  scored. 

Thus,  the  chief  advantage  of  sequential  analysis,  a saving  of  time  and 
expense,  ia  lost.  For  individual  performance  teats,  however,  the  time 
required  to  determine  a personas  rating  on  a test  item  is  insignificant 
in  comparison  with  the  time  required  for  giving  additional,  possibly 
unneeded  items. 

Sequential  sampling  is  not  as  good  fcr  diagnostic  or  instructional 
testing  as  are  conventional  tests,  becaxise  for  diagnosis  and  instruction 
it  is  commonly  desirable  to  expose  the  stixleDt  to  a wide  range  of  items, 
rather  than  to  conclvkde  the  testing  in  as  short  a time  as  poeaible. 

This  caution  does  not  necessarily  apply  to  aceentjtncw  twMt.ina, 

For  best  use  with  sequential  aampling,  the  separate  items  on  a test 
need  to  be  as  nearly  alike  as  possible,  that  is,  there  should  be  high 
it«B  Intercorrelation.  This  is  desirable  in  order  to  Increase  the 
confidence  you  can  place  in  the  results  of  almost  any  test,  but  it  is 
t particularly  iaqportant  with  sequential  analysis.  In  order  to  maximise 

item  intercorrelation,  It  appears  most  desirable  to  uso  sequential  sampling 
to  dateradne  whether  a person  passes  or  fallsa  phase  of  a course,  since 
here  the  items  will  be  very  much  alike.  A final  examination  for  deter> 
mining  whether  a person  passes  or  falls  a long  course  would  not  be  as 
good,  because  items  would  cover  a wide  breadth  of  material,  and  would 
be  much  less  alike.  The  use  of  sequential  sampling  for  determining  whether 
a person  passed  or  failed  ■ oractical  factor  examination  for  advancement 
in  rating  would  probably  be  better  than  an  examination  over  a long  course, 
but  poorer  than  an  examination  over  one  phase  of  a course. 


Advantages  of  Sequential  Sampling  in  Testing  People 

When  the  above  limitations  are  recognised,  and  necessary  precautions 
are  observed,  sequential  sampling  offers  one  tremendous  advantage  In 
testing  pe<^ie.  Those  persons  who  are  extremely  poor  or  extremely  good 
car.  be  rejected  or  accepted  after  a relatively  abort  period  of  testing. 

Whether  we  recognize  it  or  not,  whenever  we  set  up  a teat,  we  set 
certain  confidence  limits  in  the  results.  Ordlnazdly,  the  greater  the 
length  of  the  test,  the  greater  confidence  one  can  place  in  the  results. 

When  you  have  decided  what  level  of  confideocs  you  wish,  by  using  sequential 
sampling  you  can  test  extremely  poor  or  extremely  good  persons  with  far 
fewer  items  than  are  necessary  fcr  those  people  who  are  near  the  cutting 
point  in  ‘•txTie’’  ability.  In  fact,  in  certain  situations,  one  or  two  items 


I 


are  as  reliable  for  those  people  at  the  extremes  of  ability  as  ten  or 
twtnty  items  are  for  the  person  near  the  cutting  point.  Consequently j 
whei.  tlie  cost  of  testing  is  high,  as  with  performance  tests  requiring 
one-naif  to  two  hours  per  test  item,  a marked  saving  can  be  made  by 
employing  sequential  sampling  at  no  sacidiice  of  the  overall  reliability 
standard. 

In  plain  language,  when  ycu  give  the  same  number  of  items  to  people 
of  varied  ability,  you  can  place  much  more  confidence  in  your  test  results 
for  those  people  at  the  extremes  of  ability  than  you  can  for  those  people 
who  are  near  the  borderline  between  acceptance  and ‘rejection.  When  you 
use  sequential  sampling,  you  employ  fewer  test  items  for  those  people  at 
the  extremes  than.for  those  near  the  cutting  point,  and  you  place 
approximately  equal  confidence  in  the  results  for  each  level  of  ability. 


Summary 


Performance  tests  as  ordinarily  administered  require  a great  deal  of 
time  and  expense.  Sequential  sanpllng,  as  «rtan+ed  from  industrial  quality 
control,  offers  good  possibilities  of  reducing  this  time  and  expense  lor 
acceptance  testing.  Sequential  sampling  involves  taking  a sample  of 
pjerformance  and  then  deciding  whether  to  (1)  accept  the  lot  or  person, 

(2)  reject  the  lot  or  person,  or  (3)  continue  testing. 


cr 


JI . SA'i’PHMC-  IK  Tir^TIKG 


When  the  ability  of  people  is  being  tested,  it  is  often  a useftil 
assumption  than  an  infinite  number  of  test  items  are  available  for  testing 
a certain  trait.  For  example,  if  you  want  to  test  the  proficiency  of  an 
electronics  technician  on  the  practical  factors  involved  in  his  rate,  there 
is  an  almost  unlimited  number  of  performance  items  you  can  give  to  hjm. 

This  n’jmber  of  items  is  so  large  that  it  can  be  considered  to  be  nearly 
infinite. 

Katui'ally,  in  any  practical  test  we  cannot  give  all  of  .the  test  items 
which  it  is  theoretically  possible  to  give.  Instead,  we  have  to  give  a 
test  made  up  of  a sample  of  those  theoretically  possible . Thei'e  are  a 
number  of  reasons  for  this.  Some  items  may  invclve  too  nuch  expense, 
some  are  too  dangerous  to  personnel  or  equipment,  some  cannot  be  scored 
consistently,  etc.  Even  after  we  have  eliminated  all  of  the  items  which 
are  impractical  to  administer,  another  practical  consideration  forces  us 
to  use  only  a sample  of  the  remainder:  only  a certain  amount  of  time 
can  be  made  available  for  testing.  Thus  it  is  safe  to  assume  that  any 
test  which  is  adminiatered  to  people  is  a test  which  involves  only  a 
relatively  small  sample  of  the  possible  items. 

Now  we  are  not  really  interested  in  a man’s  ability  to  answer  or  do 
sample  tasks.  We  want  to  know  his  true  ability;  that  is  we  want  to  know 
how  he  would  perform  on  the  total  number  of  possible  items.  However,  wc 
cannot  get  at  a man's  ”true”  ability  except  by  using  the  sample  items  as 
a measure  of  his  "true”  ability. 

Other  things  being  equal,  the  larger  the  sample  tested,  the  better 
the  picture  we  get  of  the  person's  true  ability.  It  is  not  at  cill  uncommon 
to  find  one  hundred  or  more  items  included  in  one  paper  and  pencil  test. 

So  many  items  are  used  in  an  effort  to  boost  the  reliability  of  the  test 
to  get  a more  accurate  idea  of  the  man's  true  ability. 


In  most  performance  testing,  however,  it  is  impractical  to  give  more 
than  ten  or  twenty  test  items  because  of  the  time  required  per  item,  and 
because  the  test  must  usually  be  administered  on  an  individual,  rather  than 
a group  basis.  The  chief  reason  performance  tests  continue  to  be  used  is 
because  people  ordinarily  feel  that  they  measure  very  important  aspects 
of  a person's  job  that  cannot  be  tapped  by  paper  and  p>encil  tests. 
Obviously,  theu^. , a performance  teat  must  i::vyive  only  a sampixng  just 
as  ary  other  test.  Arid  if  this  sample  does  not  give  a good  picture  of  a 
person's  "tme"  perfortnance,  it  is  worthless.  Thus  we  must  make  a com- 
promise between  a vary  long  perforaarxie  test  with  many  items  in  the 
sample  and  a short  perfortaance  test  which  will  not  interfere  with  other 
raeeds  of  the  testing  agency. 


Fortunately,  if  our  need  is  for  acceptance  testing,  the  problem  is 
not  quite  so  acute.  In  acceptance  testing  we  do  not  need  to  got  a complete 
pictAire  of  each  man's  "tr’.ie"  ability.  We  need  only  to  deterrine  whether 
he  is  above  or  belcrij  some  standard.  That  is,  does  he  or  does  he  not  make 
a passing  taark.  However,  we  do  need  to  make  sxire,  with  a reasonable  degree 
of  confidence,  that  those  persons  whose  true  ability  is  above  the  cutting 
point  are  passed,  and  those  whose  true  ability  is  below  the  cutting  point 
are  failed.  Consequently  we  nust  still  be  concerned  with  test  reliability. 

For  many  purposes  it  is  more  convenient  to  think  of  a concept 
labelled  "probability  of  acceptance"  than  to  think  of  "reliability"  when 
we  are  discussing  acceptance  testing.  Probability  of  acceptance  can  be 
abbreviated  (>ie  way  of  looking  at  probability  of  acceptance  is  to  use 

the  so-called  "operating  characteiistics  curve." 


The  Operating  Characteristics  Curve 

Ideally,  when  we  set  a "cutting  score"  (sometimes  called  a "passing 
mark"  or  "borderline  between  acceptance  and  rejection"),  we  want  an 
operating  characteristics  curve  like  that  shown  in  Figure  1. 

This  ideal  curve  would  have  a vertical  line  immediately  above  the 
borderline  between  acceptance  and  rejection.  If  it  were  possible  to  get 
such  an  ideal  curve,  and  the  "cutting  score"  or  "passing  marie"  were  set 
at  70,  one  hundred  per  cent  of  those  who  could  make  a score  of  70  or 
better  on  an  infinite  number  of  such  items  would  be  passed  (accepted), 
while  all  those  who  could  make  a score  of  less  than  70  on  an  infinite 
number  of  such  items  would  be  failed  (rejected).  Note  that  the  horizontal 
axis  iefers  to  the  "true"  score  an  individual  would  make  on  an  infinite 
number  of  items,  not  to  the  score  he  would  ake  on  a practical  test. 

Figure  1 is  a theoretical  curve  probably  never  encountered  in  practioo. 

Yet  it  is  the  sort  of  curve  we  want  to  strive  for  in  acceptance  testing. 
(Incidentally,  a perfectly  reliable  acceptance  test  would  show  a curve 
like  that  in  Figure  1.  And  we  should  remember  fnat  reliability  places 
an  upper  limit  on  validity.) 


Percentage  of  Students  Passed 


FIGURE  1 


Ideal  Operating  Characteri-stics  Curve  for 
Acceptance  Testing  (Ideal 
0 C Curve; 


100 

90 

80 

70 

60 

50 

UO 

♦ 

30 

20 

10 

0 

100  90  80  70  60  50  lO  30  20 


10  0 


Score  on  n infinite  number  of  items  similar  to 
given  on  > test  (passing  mark  is  70) 


those 


- 8 - 


Perhaps  it  v.ill  help  us  to  get  a better  picture  of  the  orarating 
characteristics  curve  if  we  Icok  at  i.or.e  CG  curves  for  conventional  tests 
of  a length  coronor.ly  used  in  p'^rfornance  te«:tine.  Figures  ?.  3;  end  h 
show  a sei'ico  of  operating  c'naract/eristic  curves  fer  live-item  tests 


Poisson  Apprcacimation  of  Operating  Characteristic  Curves 
for  Fi:ced  Ler-ch  v Convent  tonal)  Tests,  (ad?pted  from 
Grant,  Statir  ..teal  O iality  Corn  -1,  page  323  and  Table  G) 

Vertical  axis  ■ JS  cf  students  passed 

Horizontal  axis  ■ True  score,  in  % of  items  correct 


FIGURE  2 


Fig.  2 - Five 
items  in  test. 
Passing  grade. 


ncuiiE  3 


Fig.  3 - Five 
items  in  test. 
Passing  grade, 
8056 


130  90  60  UO  20  '6 


100 

FIGURE  U 

80 

Fig.  li  - Five 
items  in  test. 

60 

Passing  grade, 
10056 

UO 

20 

\ 


\ 


\ 


\ 


X 


SO”  “80 — 5^3 — inmru 


- 9 - 


vd.th  passing  lacrkn  of  60$,  805t  and  lOOjJ  respectively.  Each  person  tested 
has  been  given  all  five  of  the  items.  Each  of  the  items  cn  these  lests 
has  been  scored  as  either  pass  or  fail.  Note  that  in  Figure  2,  if  you 
had  a group  of  people  whose  ”true”  score  was  70jb,  only  80j£  of  these 
people  would  be  accepted  (pass  the  test),  Moreo'^er,  if  you  had  a group 
of  people  whose  ’’true”  score  was  only  approximately  UOjt  of  them 
would  be  passed,  even  thou^  their  true  score  was  far  below  the  cutting 
score  set  for  the  teat.  Figures  5,  6,  and  7 present  similar  information 
for  a aeries  of  twenty  item  tests  with  passing  marks  of  60jf,  80J|,  and 
100^,  Each  person  tested  has  been  given  all  of  the  twenty  items,  and 
each  of  the  items  has  been  scored  on  a pass-fail  basis.  Here  the  picture 
is  somewhat  better,  but  it  Is  still  a long  way  fr«a  the  ideal  ctirve 
shown  in  Figure  1. 


Vertical  axis  • of  students  passed 

Horizontal  axis  • True  score,  in  % of  items  correct 


100 

FIGURE  5 

80 

Fig,  5 - Twenty 
items  in  test. 

60 

Passing  grade, 
60^ 

UO 

20 

0 

100, 

FIGURE  6 

8o; 

Fig.  6 - Twenty 
items  in  test, 

60 

Passing  grade. 

UO 

BOfL 


\ 


\ 


IW  80'  W"U0  20  5 


K 


20 

0 


\ 


100  ' 80  6(5  w 2G15 


- 10  - 


H 


6m 

1 

Fig.  7 - Twenty 

6C 

items  in  test. 

Passing  grade. 

UC 

1003K 

2C 

1 1 


V 


T5S  8o  w US  5S  S 


other  things  ot^ing  equal,  the  larger  the  number  of  items  on  the 
test,  the  closer  the  OC  curve  will  approach  Figure  1.  The  reason  for 
this  is  that  with  a small  number  of  items,  there  is  a good  chance  that 
several  of  the  items  on  the  test  happen  to  be  among  the  few  that  e person 
of  poor  ability  knows.  Conversely,  if  there  is  a small  number  of  items 
there  is  a good  chance  that  several  of  the  items  on  the  teat  happen  to 
be  among  the  few  that  a person  of  great  ability  happens  not  to  know. 

With  a large  number  of  i^ems,  this  chance  factor  becomes  less  important. 

The  concept  of  the  operating  characteristics  curve  is  one  of  the 
most  important  in  acceptance  testing.  VJlthout  it,  one  is  apt  to  fall 
into  the  comnon  error  of  accepting  test  results  as  being  necessarily 
a true  picture  of  a man's  ability. 


Summary 


Any  test  should  be  regarded  as  a sampling  of  a great  number  of 
possible  test  items,  and  consequently  test  results  are  not  necessartly 
a good  picture  of  a man's  tnie  ability.  The  operating  characteristics 
curve  Heine  to  show  this  discrepancy  in  acceptance  testing.  Other  things 
being  equal,  the  longer  the  test,  the  more  reliable  the  results. 
Performance  tests,  however,  are  by  nature  limited  in  length,  and  some 
compromise  must  be  reached  with  reliability. 


- 11  - 


III,  CHOOSING  A SBQUENITAL  SAMPUNG  PLAN 


All  sequential  sampling  plans  which  are  now  used  in  industry  are 
alike  in  that  they  can  be  represented  by  the  general  type  of  chart  shown 
in  Figure  8.  As  each  item  is  given,  the  person's  cmiulative  score  is 
plotted  above  the  item  number.  Testing  continues  until  the  graph  runs 


FIGURE  8 

Graphic  Presentation  of  a Sequential  Sampling  Plan 


u o 

to  1 ^ 


Number  of  Items  Tested 


outside  the  two  parallel  lines  into  the  "accept"  area  or  the  "reject"  area, 
That  ends  the  test.  In  the  example  shown  in  Figure  8,  a small  score  was 
made  on  item  number  one;  approximately  the  same  score  was  earned  on  item 
two,  but  on  the  third  item  a score  of  zero  was  received,  and  the  test 
ended  with  the  rejection  of  the  person  tested.  The  slope  and  origins 
of  the  two  lines  determine  how  rapidly  this  will  occui',  (Because  of  the 
fact  that  most  performance  test  items  differ  in  difficulty  and  in 
discriminative  value,  a modification  of  Figure  8 is  proposed  fur  use  in 
testing  people.  This  modification  is  described  in  the  next  chapter.) 


- 12  - 


Information  Needed  for  Chooelng  a Sequential  Sampling  Plan 

Two  sorts  of  inforisation  arc  needed  before  a sequential  samplin»? 
plan  can  be  chosen  for  testing  people:  (1)  what  constitutes  an  acceptable 

and  an  unacceptable  person;  and  (2)  whai  risks  are  you  willing  to  take 
of  accepting  a "poor"  person  and  of  rejecting  a "good"  person. 


Acceptable  and  Unacceptable  Persons  * 

Ordinarily  when  a test  is  given  for  purposes  of  accepting  or  rejecting 
individuals  in  a group,  some  sort  of  passing  score  is  used.  All  of  those 
who  score  above  this  cutting  point  are  passed,  and  all  of  those  below  are 
failed.  However,  when  sequential  analysis  is  used,  two  points  are  used, 
rather  than  a single  cutting  point. 

One  of  these  points  may  be  described  as  the  lower  limit,  or  lowest 
score  characteristic  of  the  really  good  people.  This  point  may  be 
designated  as  m^. 

The  other  point  may  be  described  as  the  higher  limit  or  highest 
score  characteristic  of  the  really  poor  people.  This  point  may  be 
designated  as  »p.  For  exan5>le,  the  lower  limit  of  really  good  people 
may  be  set  as  a score  of  70  on  a particular  test,  while  the  upper  limit 
of  really  poor  people  is  set  as  $0.  In  this  case  mg  ■ 70,  and  - 50. 

The  scores  between  cu  and  mp  fom  the  "&uae  of  inuiifexence"  and  are 
characteristic  of  people  who  are  neither  really  good  nor  really  poor. 

This  determination  of  the  lower  limit  for  really  acceptable  people, 
and  the  upper  limit  for  really  xmacceptable  people  is  the  Hrst  decision 
that  must  be  made  in  choosing  a particular  sequential  sampling  plan. 


Probability  of  Acceptance 

Aiy  sampling  plan,  whether  it  is  a traditional  test,  a cemmon 
performance  test,  an  industrial  inspection  soheme,  or  a sequential  test, 
involves  certain  risks.  These  risks  are  primarily  due  to  saiqsling  errors 
as  discussed  in  the  previous  chapter,  plus  errors  due  to  the  instability 
of  a man's  performance  from  time  to  time.  The  second  decision  that  must  be 
made  in  setting  up  a sequential  sampli.ig  plan  involves  the  determination  of 
the  risk  that  you  are  willing  to  take  for  each  of  the  two  points  discussed 
in  the  preceding  paragraphs. 


* The  assumption  is  made  throughout  this  discussion  th^t  all  pe<^le 
at  one  end  of  a scale  of  ability  are  acceptable  and  all  those  at  the  other 
end  of  the  same  scale  are  unacceptable.  This  is  the  usual  case  in  perform- 
ance testing.  If,  however,  you  wished  to  accept  only  those  people  r/ho 
were  not  too  higji  or  not  too  low  on  a scale  (such  as  finger  dexterity), 
much  of  the  following  discussion  would  not  apply. 

i 

i 


- 13  - 


Naturally  you  want  to  be  certain  cf  accepting  practically  all  of  the 
"good"  people,  and  certain  of  rejecting  practically  all  of  the  "poor" 
people.  However , the  more  certain  you  are,  the  greater  the  number  of 
test  items  required.  In  the  example  above,  if  you  are  ready  to  take 
the  risk  of  rejecting  5 o»it  of  100  people  whose  true  score  (mg)  equals 
70,  the  probability  of  acceptance  of  mg  would  equal  ,95 • In  other  words 
you  would  be  willing  to  biy  a plan  which  in  the  long  run  would  guarantee 
your  acceptance  of  95$  of  the  people  whose  true  score  was  70,  (This 
would  be  abbreviated  Pag  “ ,95,  mg  • 70,  which  may  be  interpreted  as; 
the  probability  of  acceptance  equals  .95  when  the  true  score  equals  70.) 

Similarly,  if  you  are  willing  to  take  the  risk  of  accepting  20  out 
of  100  people  whose  true  score  (mp)  equals  50,  the  probability  of  acceptance 
of  mp  would  equal  ,20.  In  other  words,  you  would  be  willing  to  buy  a 
plan  which  in  the  long  run  would  guarantee  your  acceptance  of  only  20^ 
of  the  people  whose  true  score  was  50.  (This  would  be  t'bbreviated 
Pap  “ ,20 j “ 50.) 

The  closer  Pap  is  to  zero,  or  the  closer  Pag  is  to  one,  the  more 
test  items  you  will  need  to  administer.  That  is,  the  more  certain  you 
want  to  be  in  your  judgments,  the  more  it  will  cost  you. 

Note  that  in  the  example  above,  Pag  was  closer  to  one  that  Pap  was 
to  zero.  We  wanted  to  be  more  certain  of  getting  all  of  the  "good" 
persons  than  we  wanted  to  be  certain  of  rejecting  all  of  the  "poor" 
persons.  This  is  the  usual  situation  when  many  men  are  needed,  particu- 
larly when  there  are  going  to  be  other  opportunities  later  on  of  weeding 
out  the  poor  people  who  were  inadvertently  accepted.  However,  there  is 
nothing  to  prevent  the  risk  of  accepting  mg  from  equalling  the  risk  of 
rejecting  mp  (for  example,  Pag  ■ ,90,  and  Pap  “ ,10),  Or  when  there  is 
an  ovei  supply  of  men,  or  when  the  acceptance  of  a poor  man  may  mean  serious 
consequences  such  as  the  failure  of  a ndssion,  the  risk  of  accepting  poor 
men  may  be  set  lower  than  the  risk  of  rejecting  good  men  (for  exampl®, 

Pag  - ,80,  and  Pap  ■ ,001),  During  World  War  11^  Office  of  Strategic 
Services  had  mary^^men  to  choose  from  and  vital  missions  to  perform,  so 
they  were  willing  to  take  the  chance  of  rejecting  many  good  men,  provided 
that  they  could  be  reasonably  certain  of  getting  very  few  poor  ones. 

The  choice  of  mg,  mp,  Pag,  and  Pa©  deterndnes  the  operating 
characteristics  curve  of  the  Sequential  sampling  plan. 


Means  of  Reporting  Scores 

Performance  teat  data  are  usually  available  in  a variety  of  forms. 
With  a simple  method  of  scoring,  results  may  be  reported  as  "pass"  or 
"fail".  More  precise  scores  are  usually  expressed  rwmerically,  with  as 
many  as  one  hundiwl  or  j»ore  different  grades  possible  for  one  test  item. 
SometlmAS  firadea  are  used  in  reporting  scores. 


- lii  - 


The  method  of  reporting  scores,  whether  numerical,  letter,  or  word 
grades,  Is  relatively  unis4>ortant.  It  Is  important,  however,  to  consider 
the  nuiBber  of  scores  which  students  actually  can  make  on  a test  item. 

Other  things  being  equal,  it  is  possible  to  learn  much  more  about  a 
student’s  performance  if  he  can  make  any  one  of  five  possible  scores 
on  an  item,  than  if  his  performance  is  reported  as  either  ”pass”  or 
"foil".  As  one  would  expect,  for  a given  mg,  mp,  Pag,  arvi  Pap,  it 
ordinarily  req\xLrea  far  fewer  test  items  to  determine  acceptance  when 
five  or  ten  scores  are  given  on  each  item  than  when  only  two  scores 
are  available . 

There  is  ordinarily  a practical  limit  to  the  number  of  test  scores 
f which  should  be  attainable  on  any  one  item.  If  more  than  about  ten  scores 

are  repoi^ted,  the  calculations  necessary  for  sequential  analysis  become 
F ; rather  laborious.  If,  for  example,  time  in  seconds  required  to  perform 

some  task  ««ere  used  as  a grade,  those  scores  which  students  actually 
attain  can  be  grouped  to  bring  the  total  number  of  scores  within  a 
reasonable  limit. 

As  la  usually  the  case,  however,  other  things  are  not  always  equal. 
Sometlsea  you  are  much  surer  of  your  Judgments  in  evaluating  a performance 
item  if  you  report  it  as  passed  or  failed,  rather  than  in  tenos  of  a 
score.  Uany  times  it  is  much  quicker  to  score  an  item  as  passed  or  failed. 
A coranon  exaitple  is  in  the  tise  of  objective  and  essay  type  paper  and  pencil 
examinations.  Objective  questions  are  almost  invariably  scor^  as  either 
ri^t  or  wrong.  Essay  questions  could  be  graded  with  a soore,  or  as 
pass-fail,  but  are  usually  given  a score.  Yet  objective  questions  are 
widely  used  because  they  are  easier  to  score  and  because  they  can  be 
administered  more  rapidly.  It  is  recoomended  as  a general  rule  that 
five  or  more  scores  per  item  be  used  in  sequential  analysis  whenever 
practical,  and  Uiat  when  more  than  ten  scores  arc  used,  scoree  be  grouped 
to  provide  somewhere  between  five  and  ten  intervals. 


Item  Intercorrelation  and  Difficulty 

The  original  use  of  sequential  sampling  was  in  acceptance  inspection 
for  the  armed  forces.  In  this  type  of  sanpling  the  problem  is  to  determine 
whether  a lot  ie  acceptable  by  testing  a sample  troa  that  lot.  This 
involves  making  the  same  test  on  each  of  the  products  in  the  lot.  When 
you  determine  a person's  acceptability  as  a trouble  shooter  on  radar 
gear,  the  problem  is  to  determine  his  acceptability  ly  testing  him  on  a 
sample  of  radar  trouble  shooting  prc^Ieme.  The  total  range  of  that 
person’s  trouble  shooting  ability  is  comparable  to  the  industrial  lot; 
the  group  of  trouble  shooting  items  you  give  to  that  person  is  comparable 
to  the  inspection  sample  drawn  from  the  industrial  lot;  and  one  test 
item  for  that  person  is  (xxaparable  to  a teat  of  one  piece  from  the 
industrial  inspection  ss^le. 


f 


-■  . 


- 15  - 


This  ar^logy  breaks  down  somei^at  on  the  last  corsparlson.  Each 
piece  in  the  industrial  inspection  sample  is  given  exactly  the  same  test. 
In  most  cases  we  caiuiot  give  one  person  a series  of  test  items  exactly 
alike,  for  if  he  knew  the  answer  to  one  of  them,  he  wculd  know  the  answer 
to  all.  We  would  really  be  giving  him  only  one  test  item  a number  of 
times.  (This  would  be  all  right  if  we  were  testing  a basketball  player's 
ability  to  shoot  free  throws,  but  where  ary  sort  of  problem  solving  or 
progressive  learning  is  important  in  the  test,  it  is  impractical  to 
repeat  identical  test  items.) 

In  practice  if  we  want  to  measure  trouble  shooting  ability  of  a 
certain  type,  we  prepare  several  different  trouble  shooting  test  items 
which  are  very  simiiar.  All  of  them  would  involve  trouble  shooting 
on  a particular  type  of  radar  gear,  fcr  example.  If,  through  this 
procedure,  you  obtain  relatively  high  item  intercorrelation,  a large 
part  of  your  problem  is  solved.  However,  it  is  also  necessary  that  item 
intercorrelations  be  insignificant  when  the  criterion  score  (internal 
or  external)  is  held  constant. 

One  further  step  is  necessary  in  order  to  insure  equal  difficulty 
level.  In  the  industilol  inspection  situation,  each  test  on  each  com- 
modity in  the  sample  is  of  equal  difficulty,  since  each  test  is  the 
same.  This  is  not  true  of  the  usual  type  of  item  which  must  be  chosen 
for  testing  people.  Even  though  you  obtain  high  item  interccrrelation, 
some  test  items  will  be  much  easier  than  others.  It  is  particularly 
iH5>ortant  that  this  be  taken  into  account,  in  sequential  sampling,  for 
if  several  very  easy  items  were  chosen  by  chance  to  be  the  first  itene 
administered,  and  no  adjustment  were  made  in  scoring,  practically  every- 
one tested  would  be  accepted  right  away. 

One  possible  solution  for  the  problem  of  item  difficulty  is  to 
convert  test  scores  into  some  standard  score  such  as  the  "T"  score. 

An  even  simpler  method  of  compensating  for  item  difficulty  empirically 
is  described  in  the  next  chap^r. 


The  Average  Sample  Slae  Curve 

In  sequential  analysis,  unlike  most  methods  of  testing,  there  is 
no  way  to  know  in  advance  how  many  test  items  are  required  for  determining 
the  acceptability  of  ary  one  person.  * It  is  possible,  however,  to 
determine  the  average  sample  required.  The  curve  showing  this  information 
is  known  as  an  ''Average  Sample  Size  Curve", 


* In  practice  a decision  is  usually  made  to  stop  testing  after  a 
certain  number  of  its!?»s,  end  to  declare  the  person  tested  to  be  accepted 
(or  alternatively,  to  etop  testing  after  a certain  number  of  items,  crxi 
to  declare  the  person  tested  to  be  rejected),, 


- 16  - 


The  Average  Sample  Size  curve  has  a characteristic  shape.  (See 
Figure  9.)  The  largest  number  of  items  is  required  in  testing  persons 


FIGURE  9 

Typical  Shape  of  Average  Sample  Size  Curves 


who  are  not  ’’good"  or  not  "poor”  (those  who  are  between  mg  and  nu  in 
ability).  The  height  of  the  curve  is  deteraiined  by  Pag  a^  Pap,  the 
probability  of  accepting  "good”  and  "poor"  persons,  respectively.  Its 
position  over  the  base  line  is  determined  by  sig  and  mp.  The  procedure 
for  computing  the  average  number  of  test  itema^requlrTO  for  a particular 
sequential  sampling  plan  is  presented  in  the  next  chapter. 


SummaTy 


The  choice  of  a sequential  sanpling  plan  is  detemdned  by  decisions 
as  to  (1)  what  constitutes  an  acceptable  and  an  unacceptable  person;  and 
(2)  what  risks  can  be  taken  cf  scccpting  a "poor"  person  and  of  rejecting 
a "good"  person.  Ordinarily,  a person  should  be  able  to  make  cne  of  five 
or  more  scores  on  any  one  test  item.  'Ihe  number  of  items  required  for 
testing  a particular  person  can  never  be  known  in  advance,  but  the 
average  number  of  ite^  required  can  be  determined. 


- 17  - 


iV.  iCJHAilVE  5'JUUi!.3XXUlV>  r<JA 
FJTTIMG  A SEQUENTIAL  SAiiPLING  PUN  INTO  OPERATIOM  * 


It  is  asaximed  that  a performance  test  has  been  constructed  according 
to  accepted  principles  such  as  those  outlined  in  standard  reference  work* 
in  this  field. 1 A few  more  test  items  should  be  planned  then  will  be 
needed  in  the  final  form  of  the  test. 

It  is  further  assumed  that  tentative  decisions  have  been  reached  as 
to  Pa„  and  Pap,  the  probability  of  acceptance  of  good  people,  and  the 
probability  of  acceptance  of  poor  people.  These  decisions  should  be  made 
in  accordance  with  the  principles  described  in  the  preceding  chapter. 


Tentative  Standardization  of  Test  Item* 


Arrangements  should  be  made  to  administer  all  items  of  the  test  to  a 
group  of  people  in  order  to  determine  item  difficulty  and  item  discrimina- 
tion. The  people  chosen  for  this  purpose  should  be  a random  selection  from 
a population  similar  to  those  who  will  later  take  the  test.  If,  for  example, 
you  plan  to  use  the  performance  test  at  the  end  of  the  radar  phase  of  the 
Class  A school  for  electronics  technici.ns,  your  standardization  group 
should  be  a random  selection  of  students  in  this  course  who  have  just 
finished  the  radar  phase. 

Each  performance  test  item  should  be  administered  to  each  member  of 
the  standardization  group.  (It  is  desirable  to  administer  all  of  the 
even  numbered  items,  followed  by  all  of  the  odd  numbered  items  to  half 
of  the  standardization  group.  The  other  half  of  the  group  would  take 
the  odd  numbered  items  first,  followed  by  the  even  numbered  items,  A 
test  of  significance  of  difference  of  mean  scores  should  be  computed 


# The  rationale  for  the  empirical  determination  of  item  discrimina- 
tion and  item  difficulties  described  in  this  chapter  visas  developed 
independently  by  Dr.  Lee  J,  Cronbach  of  the  University  of  Illinois  and 
by  Dr.  Jacob  Wolfowitz  of  Columbia  University.  The  mathematical  formulae 
used  here  are  those  developed  by  Wolfowitz,  based  on  the  work  of 
Dr.  Abraham  Wald.  However,  the  description  of  the  processes  used  is 
the  responsibility  of  the  present  author,  and  ary  errors  should  be 
ascribed  to  him  alone. 

1.  Adkins,  Dorothy  C,,  Construction  and  Analysis  of  Achievement  Tests, 
iyU7,  Superintendent  of  Documents,  Washington,  £).C. 

Uicheels,  W.  J.  and  Karnes,  11.  R.  Measxirlng  Educational  Achievement, 
1950,  UcQraw  Hill,  New  York, 

U.  S.  Navy,  ConatructiDg  and  Using  Achievement  Tests,  NAVPEES  16608. 
19Ui,  of  Naval  Personnel,  Washington,  D.C. 


- 18  - 


in  carder  to  determine  whether  it  is  tenable  to  assume  that  there  is  no 
ar’r»rer:i  amount  of  progressive  leaimlng  occurring  during  the  test. 

a.  If  this  hypothesiG  ic  untenable,  the  standardization  scores 
should  be  based  on  the  perf*ormance  of  only  one-half  of  the 
standardization  group,  and  in  the  future,  test  items  should 
be  administered  in  exactly  the  same  order  they  were  taken  ^ 

^e~ s ianda ixii za tlon  group. 

b.  If  tills  hypothesis  is  tenable,  the  magnitude  of  the  diffex^nce 

in  mean  scores  should  be  inspected.  If  the  difference  in 
mean  ecoz^s  ia  relatively  lax-ge,  and  the  K is  small,  you  may 
wish  to  consider  the  null  hypothesis  vintenable,  even  though 
this  has  not  been  demonstrated  statistically,  and  hence 
admirdster  the  test  items  in  standard  order.  If  the  N is 
reasonably  large,  difference  in  mean  scores  is  relatively 

small,  standardization  scores  should  be  based  on  the  perform- 
ance of  the  entire  standardization  group,  aivd  in  the  future, 
test  items  can  be  administered  in  any  order. ) 

After  the  propoaed  performance  items  have  been  administered  to  the 
stanaardization  group,  the  next  step  is  to  decide  who  are  the  "good"  men 
and  who  are  the  "poor"  men. 

a.  Total  each  man's  raw  score. 

b.  Arrange  total  scoria  in  numerical  oixier,  with  desirable  scores 
first.  Presximacly,  if  the  test  is  valid,  the  "good"  men  will 
be  at  the  top,  and  "poor"  men  at  the  bottom. 

c.  Determine  the  score  which  separates  the  "good"  men  from  those 
Mio  are  mediocre.  This  score  is  designated  mg.  Determine  the 
scor«  which  separates  the  really  "poor"  men  from  those  viio  are 
mediocre.  This  score  is  designated  nip.  (In  most  school  situa- 
tions, "good"  men  will  be  those  who  would  score  "A",  "B",  or 
"C" , and  really  "poor"  men  would  be  those  who  would  be  failed. 
Mediocre  men  would  be  those  who  would  receive  a grade  of  "D".) 

d.  Put  the  nastes  of  the  good  men  in  one  list,  and  names  of  the 
really  poor  men  in  a second  list. 

The  Last  step  in  the  standardization  process  is  the  determination 
of  the  discrimination  scores  for  each  item.  The  discrimination  score  may 
be  abbreviated  Dg,  and  is  determined  by  dividing  the  proportion  of  poor- 
people  makjng  a certain  score  on  one  item  by  the  proportion  of  good  people 
making  the  same  score  on  that  item. 

a.  Group  the  raw  scores  that  it  is  possible  to  make  on  i-tem  number 
one,  so  that  you  have  between  five  and  ten  groups. 

b.  Tabulate  the  number  of  good  people  who  fail  into  each  score 
group  on  item  nximber  one.  Determine  the  proportion  of  good 
people  in  each  group. 

c.  Tabulate  the  number  of  poor  people  who  fall  into  each  score 
group  on  item  number  one.  Determine  the  proportion  of  poor 
people  in  each  group. 

d.  For  each  raw  score  group  on  item  number  one,  you  should  have 
two  proportions.  Divide  the  proportion  fo\i^  in  "c"  above  by 
the  proportion  found  in  "b"  above.  (That  is,  divide  the  pro- 
portion of  poor  people  in  a certain  score  group  on  an  item  by 
the  proportion  of  good  people  in  that  same  score  group  on  that 


I 

I 


- 19  - 


I 


1 

I- 

I 


F 

I 

F 

I 


1 

item.)  The  quotient  is  the  discrijidnation  score.  This  process 
converts  each  raw  score  into  a discrimination  score  (Da).  These 
discriminsticn  scores  can  range  in  from  Kero  to  infinity. 

e.  Repeat  steps  "a”  through  "d”  for  each  test  item. 

f . Check  item  number  one  to  make  sure  that  the  discrimination  scores 
form  a sequence  from  hi^  to  low,  with  no  reversals  in  value,  and 
no  Dg  values  of  infinity.  If  there  are  reversals  or  values  of 
infinity,  employ  curve  smoothing  to  eliminate  them.  C\irve 
smoothing  may  be  done  plotting  the  Ds  values  and  drawing  a 
smooth  curve  by  inspection,  or  by 

(1)  averaging  the  proportions  of  good  people  in  each  set  of 
three  adjacent  cells; 

(2)  averaging  the  proportion  of  poor  people  in  each  set  of 
three  adjacent  cells; 

(3)  computing  the  Dg  values  from  these  averages. 

Consider  the  following  example: 


Item  Number  One 


Raw 

Score 

Good  People 
Number  Proportion 

Poor  People 
Number  Proportion 

1-20 

1 

.05 

10 

21-UO 

0 

.00 

1 

lil-60 

U 

.20 

3 

61-80 

8 

.1*0 

1 

81-100 

7 

35 

0 

total 

20 

: .00 

15 

mm 

Ds 


10.00  t 

infinity! 

1.00 
.18 
.00 

i 

I 

I 

I 


There  is  a reversal  between  the  Ds  corresponding  to  raw  scores 
of  1-20  and  raw  scores  of  Ul-60,  since  the  Ds  values  are  not 
in  numerical  order.  fJoreover,  there  ’is  one  D's  value  of  infinity. 
We  can  smooth  these  figxires  by  recomptiting  the  proportion  of  . 
good  people  who  made  raw  spores  of  21-UO  by  aver&'ging  the 
original  proportion,  .00,  with  the  two.  proportions  on  each, side 
of  it,  (.05,  corresponding  tt  a score  of  1-20;  and  .20  corres- 
ponding to  a score  of  l^l-bO).  ibis  will  yield,  an  average  . 
proportion  of  '.08.  Repeat  this  for  each.of.che  proportions 
fbr  both  good  and  poor  people  ijsing  three  adjacent  proportions 
•for  each  average.  (For  the  hipest  and  lowest  scores,  there 
are  no  data  available  for  the -third  proportion.  In  this  case, 
it  is  usually  best  to- assume  that  the  -unknown  proportion  is 
the  same  as  the  last  known  proportion.  Thus  the  smoothed 
proportion  of  good  people  corresponding  to  a raw  score  of 
81-100  would  be  the  average  of  .1*0,  .35,  and  .35*) 


f 


- 20  - 


« 


If  this  procedure  is  foU<nved,  the  exssple  would  appear  like  this: 

I 

Item  Number  One 


Raw 

Score 

Good  People 
Proportion 

Poor  People 
Proportion 

®s  i 

1 

1 

1-20 

.03 

.U7 

15.7  ' 

21-UO 

.08 

.31 

3.88 

Ul-60 

.20 

.11 

' 61-80 

.32 

.09 

.28 

' 81-100 

.37 

.0? 

.05 

total 

1.00 

1.01 

1 

1 : 

Occasionally,  it  may  be  desirable  to  go  one  step  further,  and 
employ  curve  smoothing  on  the  D9  values. 

Reversals  of  the  type  described  above  are  caused  by  too  small  a 
standardization  sample,  or  by  items  which  are  unreliable. 
Ideally,  items  which  show  reversals  should  be  discarded,  but 
in  view  of  the  small  samples  normally  available  fer  initiol  . 
standardization,  curve  smoothing  r.111  usually  give  interpretable 
results.  However,  if  the  poor  men  make  better  scores  than  the 
good  men,  thC  Item  Should  be  discarded,  at  least  until  more  - 
data  can  be  obtaiwd.  ' 

g.  Repeat  step  "f”  for  each  test  item. 

Discrimination  scores  obtained  on  the  itesB  which  are  retained 
after  the  original  standardization  process  should  not  be 
regarded  as  fixed  values,  but  should  be  corrected  as  additional 
data  became  available  during  the  use  of  the  test. 


p. 


Computation  of  A and  B 

In  the  previous  chapter,  considerable  attention  was  paid  to  Pa^  and 
Pdp,  the  probability  of  acceptarice  of  good  and  poor  men,  respectively. 
These  values  are  used  in  computing  A and  B,  which  are  the  li^ta  for 
discrimination  scores,  and  determine  when  a person  is  accepted,  rejected, 
or  when  additional  testing  needs  to  be  done.  These  relationships  are 
rather  simple: 


B • Pap  • Point  of  acceptance 


A *■  1 - Pap  “ Point  of  rejection 

r^g 


- 22 


For  exai!^>le,  if  the  probability  of  acceptance  of  gocvl  men  »ere  set 
at  ,95,  and  the  probability  of  acceptance  of  poor  men  were  set  at  ,20, 

A and  B could  be  determined  as  follows: 


B 


A 


Pan  “ .20  - .21 

pT  ^ 

g 

1 - Pap  - 1 - .20 

• .80 

T-TTSi 

16.0 


Values  of  A and  B for  a variety  of  conmon  values  cf  Pag  and  Pap  are 
shown  in  Table  1, 


Scoring  Performance  Items 
During  Rou^ne  I'est  Acbninistration 


As  each  man  completes  a performance  test  item,  his  raw  score  is 
determined,  and  then  converted  to  a discrimination  score,  using  a conversion 
table  based  on  the  standardization  process  described  above.  His  discrimina- 
tion score  is  then  compared  with  the  values  determined  for  A and  B.  This 
will  result  in  one  of  three  actions: 

1,  If  the  man's  discrtmination  score  is  equal  to  or  greater  than  A, 

he  is  immediately  rejected  (flunked),  and  takes  no  more  test  items. 

2,  If  the  man's  discrimination  score  is  equal  to  or  smaller  than  B, 
he  is  immediately  accepted  (passed),  and  takes  no  more  test  items, 

3,  If  the  man's  discrimination  score  is  between  A or  B,  he  proceeds 
to  the  second  test  item. 

Suppose  that  for  a particular  man,  action  3 is  indicated.  After  he 
has  completed  test  item  number  two,  his  raw  score  on  this  item  is  deter- 
mined, and  converted  to  a discrimination  score.  Since  a man's  score 
in  a sequential  test  is  based  on  all  of  the  items  he  has  tedeen  previously 
during  the  test,  we  multiply  the  discrimination  score  he  made  on  the 
second  item  by  the  discrimination  score  he  made  on  the  first  item,  and 
cwnpare  the  result  with  A and  B.  This  v,ill  again  result  in  one  of  the 
three  actions  outlined  above. 

Suppose  that  after  the  second  test  it^m,  action  3 is  indicated  again. 
Test  item  number  three  is  adadnistered,  a discrimination  score  determined, 
and  multiplied  by  the  product  of  all  previous  discrimination  scores  (Ds 
for  item  one  )C  Ds  for  item  two,  X Dp  for  item  three)  and  compared  with  A 
and  B. 

This  process  continues  until  the  man  is  either  accepted,  or  rejected, 
or  until  no  more  te.<t  items  are  available.  If  no  more  test  it^ms  are 
available,  the  man  is  declared  to  be  accepted,  if  there  is  a critical  need 
for  men;  or  rejected,  if  there  is  not  a critical  need  for  men. 


- 23  - 


I 


! 


1 


r 


Example  of  Proceoures 
In  Sequential  Ana]yglt< 


Ten  performance  test  items  on  trouble-shooting  the  oUlb  rauar  were 


(prepared  ana  administered  to  fifty-seven  students  who  had  just  completed 


the  radar  section  of  a 
following  total  scores 

Navy  Class  A 
were  obtained 

electronics 

• 

1 • 

tecnnicians  school. 

The 

lii8 

96 

7U 

62 

56 

36 

132 

93 

7U 

62 

56 

3I4 

130 

90 

7U 

61 

55 

}h 

123 

89 

73 

60 

5U 

30 

115 

88 

72 

59 

5U 

29 

nli 

88 

70 

59 

51 

27 

110 

87 

69 

59 

hi 

25 

lOU 

Sh 

66 

®g 

hS 

2h 

102 

80 

66 

56 

hi 

®D 

98 

76 

65 

56 

36 

High  scores  indicated  good  performance. 

It  was  decided  arbitrartly  that  all  men  who  scored  above  $8  were 
definitely  good  men,  and  that  all  those  who  scored  below  37  were  definitely 
poor  oe.i  who  needed  additional  training. 

Discrimination  scores  were  determined  for  item  number  one  as  follov<s: 


Item  Number  One 


Raw 


Good 


reopao 


Poor  People 


D-  - 


Score 

Number 

Proporti on 

Number 

proportion 

Good 

0 

11 

.30 

6 

.67 

2.23 

3 

6 

.16 

3 

.33 

2.06 

6 

2 

.05 

0 

.00 

0 

12 

2 

.05 

0 

.00 

0 

16 

5 

.11. 

0 

.00 

0 

i 2h 

6 

.16 

0 

.JO 

0 

30 

5 

.Hi 

0 

.00 

0 

total 

37 

1.00 

9 

1.00 

1 

- 2h  - 


Similar  proceaxires  were  followed  for  the  remaining  nine  items,  and 
the  following  diRcrimination  scores  were  obtained; 


Item  No,  Two 

Item  No. 

Three 

Item  No 

. Four 

!kaw  Dg 

Raw 

J>8 

Raw 

^s 

Score  — 

Score 

Score 

1.57 

“7T” 

2.20 

“•5“ 

U.Uo 

1-6  1.37 

1-6 

2.05 

2 

1.57 

0-10  .29 

8-10 

.19 

U-6 

.9U 

0 

.92 

10 

0 

Item  No.  Five 

Item  No. 

Six 

Item  No. 

Seven 

ASW  Dg 

Raw 

w 

Raw 

Ds 

Score  _ 

Score 

Score 

0-3  2.17 

“ 

2.97 

2.75 

5-12  .79 

U-8 

.58 

U-8 

1.38 

18-30  0 

12-20 

0 

12-20 

.37 

Item  No,  Eight 

Item  No. 

Nine 

Item  No 

. Ten 

Raw  Dg 

Raw 

Cs 

Raw 

Ds 

Score  __ 

Score 

Score 

”0“  2.33 

’'0-2" 

2.U8 

0T3 

1.59 

2-1  1.30 

U 

2.20 

6 

1.38 

8 1.00 

8-12 

.69 

12-18 

1,00 

12-20  0 

16-20 

0 

2U-30 

0 

Since  there  were  only  nine  "poor"  people  in  the  sample,  these  dis- 
criminstion  values  are  regaidrU  only  tentative.  They  should  be  cor- 
rected as  additional  data  are  available. 


Application  of  Ds  Values 


J f it  were  decided  to  use  these  particular  performance  test  items 
in  some  future  testing  program  the  orocedure  would  be  ps  follows 

a.  Determine  Pag  and  Pa-  Suppose  that  since  the  Navy  needed  all 
of  the  good  FT'^  it  could  get,  a decision  was  made  to  risk 
failing  only  5 per  cent  of  the  good  men.  Thus  the  probability 
of  acceptance  for  good  men  (Pag)  would  be  ,95.  Since  there 
would  be  further  opportunities  for  screening  men  at  later  dates, 
it  might  be  decided  to  take  a risk  of  accepting  20  per  cent  of 
the  poor  men.  Thus  the  probability  of  accepting  poor  men  would 
be  .20. 

Referring  to  Table  I,  these  figures  give  an  "A"  of  16.0,  and  a "B" 
of  .211. 

/ 

b.  Give  one  test  item  to  the  first  mar.  tested.  The  first  man 
tested  could  be  given  any  one  of  the  ten  items.  (Suppose  that 
item  number  five  were  used  first,  and  the  ;aan  made  a raw  score 
of  two  on  it.) 


V 


- 25  - 


c.  Determine  the  Ds  value  cf  the  raw  score  made  on  the  item  just 
■t^ken.  (The  Ds  vnlue  of  a raw  score  of  two  on  item  number  five 
frc’.’.  the  p^'scedin:  tables  is  2.17.) 

d.  'c"  . :re  this  Do  v.'luc  • • th  the  A and  B deterrair^d  in  step  "a” 

. abc-  ..  Stop  tcsc,.-'^  s.i  : reject  the  man  if  the  Ds  value  is 

equal  to  or  greaucr  than  A.  Stop  testing  and  accept  him  if 
the  Dg  value  is  equal  to  or  smaller  than  B, 

(Since  ?.17  is  between  16  and  .211,  we  would  continue  testing  our  man.) 

e.  If  the  decision  is  to  continue  testing,  administer  a second  item, 
and  repeat  step  "c”  above.  (Suppose  that  item  number  three  was 
administered,  and  a raw  score  of  five  obtained.  This  would 

give  a Ds  value  of  2.05.) 

f.  ■ ply  the  Dg  values  for  the  first  and  second  items  taken, 

B.'.  T'-  at  step  "d"  above.  (2.05  times  2.17  is  L.85,  so  the 

dfcciJior.  is  to  continue  testing.) 

g.  If  a^idit'. o.ial  items  are  needed,  administer  them  one  at  a time. 

Xul bi.ply  !:  '.e  Dq  value  for  the  latest  item  taken  by  the  result 
of  a'd~pie\iou8  multiplications  of  Ds  values.  After  each  item, 
repeat  step  "d"  above.  (Suppose  that  the  third  item  administered 
was  item  nximber  four,  and  a raw  score  of  zero  was  obtained. 

This  has  a Ds  value  of  U.UO.  Ii.UO  times  U.05  (obtained  in  step  ”f”) 
is  21.3li,  so  the  man  is  rejected  and  testing  stopped.) 


Suggested  Hodification  of  Scoring  Performance  Items 
During  Routine  test  Administra -ion 

Cronbach  has  suggested  that  fewer  errors  are  apt  to  result  during 
routine  test  administration  if  logarithms  of  A,  B,  and  Ds  values  are  used. 
If  this  procedure  is  employed,  Ds  values  can  be  added  instead  of  being 
multiplied.  A simple  experiment  with  personnel  of  the  type  who  will 
administer  the  performance  test  should  quickly  indicate  whether  errors 
of  multiplication  or  errors  of  addition  (using  positive  end  negative 
numbers)  are  most  in?)ortant. 


Estimating  an  Operating  Characteristics  Cujrve 

The  operating  characteristics  curve  is  a graphic  representation  of 
the  efficiency  of  ary  test.  Its  use  was  described  in  Chapter  II. 

For  most  practical  purposes,  the  operating  characteristics  curve  for 
a sequential  sampling  plan  can  be  determined  from  four  points.  IVvo  of 
these,  Pag  - and  Pap  - Pp,  have  been  determined  previously.  The  other 
two  are  established  by  the  facts  that  people  who  have  zero  true  ability 
will  never  bo  accepted,  and  that  people  who  have  perfect  true  ability  will 
always  be  accepted. 


i If  additional  points  on  the  oc  curve  are  needed  for  increased  accuracy, 

f they  may  be  obtained  through  a process  outlined  in  Sequential  Analysis  of 

\ Statistical  Data:  Applications,  19U5,  Columbia  University  Press,  pages  U.19 

I . 

[ This  process  consumes  a considerable  amount  of  time.  It  is  usually 

h unnecessary  to  graph  even  the  simplified  oc  curve  shown  here  except  when 

9 the  effects  of  choosing  different  mg,  mp,  Pag  and  Psp  values  are  to  be 

U compared. 


-27- 


Coirnutatlon  of  Average  Sample  Size  Pequir»^d 

The  comoletG  average  sample  size  curve  for  a given  sequcn+icl  campling 
plan  can  be  determined  from  fori'i'ilee  described  on  pages  U.20-L.23,  of  the 
Columbia  University  publication  described  auo.j.  For  ordinary  purposes, 
however,  the  average  sample  size  can  be  determined  empirically  from  data 
provided  by  the  standardization  group. 


The  procedure  for  this  determination  is  as  follows j 


a.  Complete  the  administration  of  the  test  to  the  standardization 
group,  and  calculate  A and  B from  your  determination  of  Pap  and  Pag. 

b.  Consider  one  man  at  the  ti?ne  from  the  standardization  group. 
lAiltiply  accumulatively  the  Dg  value  he  obtains  on  each  test  item. 
Starting  with  item  number  one,  go  through  each  of  the  items  he 
took,  and  determine  the  item  on  which  he  was  first  failed  or 
first  accepted.  If  this  was  on  the  fifth  item  he  took,  record 
the  number  five. 

c.  Do  the  same  for  each  man,  and  determine  the  average  number  of 
items  it  took  to  reach  a decision. 

You  should  be  prepared  to  administer  about  three  times  the  average 
number  of  items  required  by  the  standardization  group.  If  this  value  is 
too  large,  lower  A or  raise  B. 


For  example,  if  John  Doe,  in  your  standardization  group,  bad  the 
following  scores: 


Sequence  in 

Ds 

v/hich  Items 

Value 

Were  Taken 

1 

2.1 

2 

.7 

3 

3.U 

U 

2.7 

5 

U.o 

6 

U.o 

7 

.3 

8 

2.0 

9 

1.7 

10 

3.1* 

11 

1.7 

12 

1.8 

were  20,  and  B 

were  .05 

fifth  item. 


Ciunulative 
Ifcxltlpllca ti on 
of  Dg  Values 

2.1 

1.U7 

5.Ui 

13.88 

222,1 

66.6 

133.2 

226.U 

769.8 

1308.7 

2355.7 

this  person  would  be  rejected  on 


the 


Note  that  the  average  sample  size  will  increase  if  A and  B are 
apart,  and  fd.ll  decrease  if  A and  B are  closer  together. 


farther 


The  above  procedure  will  be  satisfactory  if  items  are  administered  in 
the  same  order  as  was  used  in  the  standardization  procedure.  It  will 
probably  be  satisfactory,  even  if  the  order  of  administration  is  changed, 
provided  that  each  of  the  items  has  approximately  the  same  discrimination  value. 


- 28- 


I 


I- 


t, 

r.  ■ 


Alterreoive  Procedure  fgr  Calculating 
Average  Sample  Size 

A considei-sbly  more  accurate,  but  slightly  more  involved  procedure 
would  involve  the  computation  of  the  geometric  mean  of  D5  value?  for  each 
man,  and  a determination  of  the  power  to  which  this  mean  would  have  to  be 
raised  in  order  to  approximate  the  A value  used  for  the  poor  group,  and 
the  B value  used  for  the  good  group.  (This  procedure  is  not  applicable 
if  any  of  the  smoothed  Dg  values  are  zero  or  infinity.) 

a.  Complete  the  administration  of  the  test  to  the  standardization 
group,  and  calculate  A and  B from  your  determination  of  Pap  and  Pag, 

b.  DeterMne  log  A and  log  B. 

c.  Determine  the  log  of  each  Dg  value  in  your  smoothed  standardization 
data. 

d.  Calculate  the  log  of  the  geometric  meau  of  Dg  values  earned  by 
the  poor  group. 

Log  geometric  mean  of  ^ £.  Log  Ds  poor  group 

Dg  values  for  poor  group  ^o.  of  men  in  poor  group  t 

No.  of  items  per  man 

e.  Calculate  the  log  of  the  geometric  mean  of  Dg  values  earned  by 
the  good  group. 

Log  geometric  mean  of  . ^ ^ Log  Dg  good  group 
Da  values  for  good  group  Ko.of  men  in  good  group  X 

No.  or  items  per  man 

f.  Divide  log  A hy  the  value  obtained  in  "d"  above.  This  is  the 
average  number  of  items  required  to  fail  a poor  man. 

g.  Divide  log  B by  the  value  obtained  in  "e"  above.  This  is  the 
average  number  of  items  required  to  accept  a good  man. 

h.  Multiply  the  value  obtained  in  ”f”  above,  by  the  proportion  of 
poor  men.  (If  you  have  10  poor  man  and  UO  good  men,  the 
proportion  of  poor  mer.  is  .20;  disregard  the  ’’indifferent"  men.) 

i.  Multiply  the  value  obtained  in  "g"  above  by  the  proportion  of 
good  men. 

j.  Add  the  values  obtained  in  "h"  and  "i"  above.  You  should  be 
prepared  to  administer  approxinately  three  times  this  number  of 
test  items  to  s<mie  men.  If  this  value  is  too  large,  lower  A 

or  raise  B. 


♦ This  procedure,  based  on  information  theory,  is  suggested  by 
Crcnbach. 


- 29  - 


Deteminati  on  of  ItLnlmuffi  Number  oi  itecs 
Requi.redl  to  Pass  or  ^ail  a Testae 

To  determine  the  minimum  number  of  items  necessary  to  fail  a student, 
arrange  the  five  or  six  hipest  Dg  values  in  rank  order.  If  the  largest 
Ds  value  is  larger  than  A,  a person  can  be  failed  after  taking  only  one 
5 tern.  If  the  largest  Ds  value  is  smaller  than  A,  multiply  it  by  the 
second  largest,  and  again  compare  with  A.  Continue  until  a value  as 
large  or  larger  than  A is  obtained.  The  mininnom  number  c-f  items  necessary 
to  fail  a student  is  equal  to  the  number  of  Ds  values  multiplied  togethc'^ 
to  exceed  A. 

To  de+4sniiine  thie  minimum  number  of  items  necessary  to  pass  a student, 
arrange  the  five  or  six  lowest  Ds  values  in  rant  order.  If  the  smallest 
Ds  value  is  smaller  than  a person  can  be  pas::ed  after  taking  only  one 
item.  If  the  smallest  Dg  value  is  larger  than  II,  multiply  it  by  the  next 
smallest,  and  again  compare  witli  B.  Continue  until  a value  as  large  or 
larger  than  Bis  obtained.  The  minimum  number  of  items  required  to  fail  a 
student  is  equal  to  the  numoer  of  Ds  values  multiplied  together  to  reach 
a value  less  than  B. 


Sunmary 


Sequential  sampling  appears  to  be  useful  in  testing,  whenever; 

1.  Testing  time  per  test  item  is  high  in  relation  to  the  time 
required  to  score  each  test  item,  and 

2.  The  test  is  primarily  designed  to  determine  whether  a person 
’’passes"  or  "fails" , and 

3.  There  is  a need  for  testing  more  than  about  one  hundred  persons 

on  the  same  test,  either  in  one  group  or  in  a number  of  groups,  and 
U.  There  is  negligible  correlation  between  items  when  oriterion 
scores  are  held  constant. 


Sequential  sampling  takes  item  difficulty  and  item  discrimination 
ii  -o  account  when  discrimination  score  values  (norms)  ans  established. 

Sequential  sampling  can  readily  be  adapted  to  changing  standards  of 
accepting  people,  with  no  revision  of  the  norms  previously  set  up. 

Ordinarily,  sequential  sampling  will  give  about  the  same  accuracy 
as  a fixed  length  test,  with  about  half  of  the  testing  time,  and  about 
half  of  the  testing  cost. 


- 30  - 


A P PEND  IX 

Results  of  Se^eniial  Sampling  Compared  with 
Administration  of  a Fixed  Length  Test 


AS  a check  on  the  efficiency  of  sequential  sampling,  the  Ds  values 
obtained  from  the  standardization  process  described  in  the  last  chapter 
were  applied  in  sequential  fashion  to  the  scores  obtained.  The  following 
results  wei-e  obtained,  using  A ■ 16,  B ■ .211; 


)IAN 

total 

CLASSIFICATION 

CLASSIFICATION 

NO.  OF 

NO. 

RAW  SCORE 

ON  TOTAL  RAW 
SCORE 

ON  SEQUENTIAL 
SCORE 

SB3UENTIAL  ITE2S 
REQUIRfD 

U7 

lii8 

good 

pasf. 

1 

U6 

132 

It 

II 

1 

2h 

130 

It 

II 

5 

19 

123 

It 

II 

k 

U3 

115 

It 

II 

1 

36 

Uli 

II 

II 

7 

10 

110 

*• 

II 

2 

27 

lou 

n 

•• 

T 

50 

102 

II 

II 

2 

35 

98 

It 

II 

5 

U 

96 

tt 

II 

1, 

•i^ 

8 

93 

II 

II 

3 

5 

90 

ft 

tt 

7 

lil 

89 

It 

II 

U 

51 

88 

It 

II 

6 

lU 

'88 

n 

II 

7 

U8 

87 

tt 

II 

U 

32 

8U 

ft 

II 

5 

23 

80 

tt 

It 

7 

6 

76 

ti 

It 

7 

16 

7U 

n 

II 

U 

17 

7U 

It 

II 

5 

Uk 

7U 

tt 

II 

1 

39 

73 

n 

II 

U 

20 

72 

tt 

ti 

10 

21 

'70 

ft 

n 

5 

29 

69 

It 

fail 

7 

);9 

66 

It 

pass 

3 

11 

66 

tt 

n 

3 

55 

65 

tt 

:* 

9 

38 

62 

ft 

fail 

3 

15 

62 

*1 

pass 

10 

3h 

61 

n 

II 

2 

57 

60 

tt 

n 

2 

1 

59 

II 

II 

26 

59 

ti 

fail 

5 

“31 


HAN 

TOTAL 

CLASSmCATTON 

CLASSIF7CATI0N 

NO.  OF 

NO. 

RAW  SCORE 

ON  TOTAL  RAW 

ON  sequential 

SEQUENTIAL  ITEJbS 

sr.nwr 

SCORE 

REQUIRED 

U2 

59 

good 

pass 

5 

7 

56 

indifferent 

II 

7 

3 

56 

II 

II 

10 

Uo 

56 

II 

fail 

7 

37 

56 

tt 

pass 

3 

2? 

55 

tt 

fail 

6 

15 

5U 

tl 

pass 

8 

52 

5U 

n 

fail 

6 

13 

51 

fr 

pass 

1 

30 

1*7 

n 

II 

5 

2 

U5 

n 

It 

8 

25 

1*1 

tt 

9 

51* 

36 

poor 

tl 

U 

56 

36 

ft 

t| 

7 

5 

3U 

ft 

It 

U 

12 

3li 

tt 

II 

8 

31 

30 

tt 

It 

0 

33 

29 

tt 

tt 

7 

53 

27 

tt 

tl 

6 

1*5 

25 

tt 

II 

3 

28 

21* 

ft 

It 

5 

The  average  number 

of  items  required  to  reach  a decision 

in 

sequential  sampling  was  only  U.I42  instead  of  10,  a saving  of  over  thx-ee 
hours  in  average  performar^e  testing  time.  The  biserial  correlation 
between  original  total  raw  score  and  sequential  pass-fail  was  ,83. 

However,  it  should  be  rioted  that  this  correlation  is  somewhat  contaminated, 
and  in  order  to  be  verified,  should  be  re-computed  on  data  not  used  for 
standardisation  of  the  test. 


